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Abstract 

Efficient Human Epithelial-2 (HEp-2) cell image classification can facilitate 
the diagnosis of many autoimmune diseases. This paper presents an auto¬ 
matic framework for this classification task, by utilizing the deep convolu¬ 
tional neural networks (CNNs) which have recently attracted intensive atten¬ 
tion in visual recognition. This paper elaborates the important components 
of this framework, discusses multiple key factors that impact the efficiency of 
training a deep CNN, and systematically compares this framework with the 
well-established image classification models in the literature. Experiments 
on benchmark datasets show that i) the proposed framework can effectively 
outperform existing models by properly applying data augmentation; ii) our 
CNN-based framework demonstrates excellent adaptability across different 
datasets, which is highly desirable for classification under varying laboratory 
settings. Our system is ranked high in the cell image classification competi¬ 
tion hosted by ICPR 2014. 

Keywords: Indirect immunofluorescence, staining patterns classification, 
deep convolutional neural networks, data augmentation 


1. Introduction 


Indirect immunofluorescence (HF) on Human Epithelial-2 (HEp-2)_cells is 


a recommended methodology to diagnose autoimmune diseases flRigon et ah 
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2007). However, manual analysis of IIF images leads to crucial limita¬ 


tions, such as the subjectivity of result, the inconsistence across labora- 
tories, and the low effici e ncy in processing a larg e number of cell images 
( Meroni and Schur . 20ld: Foggia and Vento . 2013 b To improve this situa¬ 
tion, automatic and reliable cell images classification has become an active 
research topic. 

Many methods have been recently proposed for this topic, esp ecially 
during the HEp-2 cell classifica t ion c ompetitions flFoggia and Ventol. 12013 ; 


Foggia et al.l . 2014; Lovell et ah . 2014). Most of them treat feature extrac¬ 
tion and classification as two separate stages. For the former, a variety 
of hand-crafted features are adopted, including local binary pattern (LBP) 
( He and Wang . 1990l; Nosaka and Fukul 2014 : Theodorakopoulos et al. . 2014b), 
scale-invar iant feature tran sform (SIFT) ( Lowel . 2004f) . histogram of oriented 


gradients (iDalal and Trigg, si. 120051) . discrete cosine transform, and the st atis- 
tical features like gray-level co-occurrence matrix ( Haralick et al.l. 1973 1 and 


gray-level size zone matrix (IThibault et ah , 2014). For the latter, nearest- 


neighbor classifier, boosting, supp ort vector machines (SVM) and multiple 
kernel SVM have been employed ( Wiliern et ah . 2014fh As a result, the per¬ 
formance of these classifiers relies highly on the appropriateness of the empir¬ 
ically chosen hand-crafted features. Moreover, because features and classifier 
are treated separately, they cannot work together to maximally identify and 
retain discriminative information. 

Very recently, deep convolutional neural networks (CNNs) have consis- 
tently achieved outstanding performance on generic visual recognition tasks 


( Krizhevskv et al. . 2Q12h and th is has revived extensi ve research interest 


m 


CNN-based classification model (Razayian et al., 2014). The CNNs consist of 
multi-stage processing of an input image to extract hierarchical and high-level 
feature representations. Many hand-crafted features and the corresponding 
classification pipelines can be regarded as an approximation to or a special 
case of the CNNs, by sharing some basic building blocks. Nevertheless, these 
features and pipelines have to be carefully designed and integrated in order 
to preserve discriminative information. The excellent performance achieved 
by deep CNNs on generic visual recognition and the high demand for full 
automation of HEp-2 cell image classification motivate us to research the 
CNNs for this classification task. 

To this end, we propose an automatic feature extraction an d classi fication 


framework for HEp-2 staining patterns based on deep CNNs ([LeCun et al. 


1998). This framework extracts features from the raw pixels of cell images 
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and avoids using hand-crafted features. Feature representations for each kind 
of staining patterns are learned and optimized via training the multi-layer 
network. Also, the classification layer is jointly learned with this network to 
predict the probability of a cell image for each class. The highly non-linear 
and high-capacity properties (jLeCun et ah, 2012) make the multi-layer CNNs 
difficult to train, especially when the number of training samples is not suf¬ 
ficiently large. We explore multiple important aspects in this CNN-based 
classification system, including network architecture, image preprocessing, 
hyper-parameters selection, and data augmentation, which are important for 
CNNs to achieve effective and reliable cell classification. Furthermore, we 
conduct rigorous experimental comparison with two state-of-the-art hand- 
designed shallower image representation models, i.e., bag-of-features (BoF) 
and Fisher Vector (FV), to investigate the advantages and disadvantages of 
our CNN-based framework on cell image classification. Our system has par¬ 
ticipated in the Contest on Performance Evaluation on Indirect Immunoflu¬ 
orescence Image Analysis Systems hosted by ICPR 201^3 and won the fourth 
place among 11 international teams. 

The rest of the paper is organized as follows. Section [2] reviews the clas¬ 
sification models of BoF, FV and deep CNNs. In Section |3l our CNN-based 
framework for cell images classification is presented and a set of key factors 
are discussed. Section [4] reports the experimental investigation and compar¬ 
ison, and the conclusions are drawn in Section 0 

We were invited by the ICPR 20 14 co ntest organizers to report our sys¬ 
tem in a workshop short paper f Gao et ahl . 2014h . This paper significantly 
extends that workshop paper in the following aspects: i) a more detailed 
description of our deep CNN-based classification framework for HEp-2 cell 
images is presented and multiple key factors for effectively training a reli¬ 
able deep CNN are discussed and experimentally demonstrated; ii) the role 
of image rotation as a data augmentation method in helping the deep CNN 
to achieve robust representations in this classification task is investigated 
and analyzed; iii) systematic experimental comparisons of our CNN-based 
framework and the state-of-the-art hand-designed classification models are 
conducted; iv) the excellent adaptability of our cell classification system with 
respect to different laboratory settings is demonstrated by transferring the 
learned network across two datasets with easy implementation, which makes 


1 Contest website is at http: //i3a2014 .unisa. it/?page_id=91. 
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our system attractive for practical clinical applications. 


2. Related Work 


2.1. Bag-of-features and Fisher Vector Models 

The BoF model f Csurka et al. . 2004h generally consists of four stages: 


local feature extraction, dictionary learning, feature encoding, and feature 
pooling. The dictionary is composed of a set of visual words describing 
the common visual patterns shared by local descriptors. The relationship 
between local descriptors and visual words is characterized by feature en- 


(Liu et al.. 

2011: 

Wang et al., 2010: Jegou et al., 2010 

: Boiman et al.. 

2008 

). 

On top of these, spatial pyramid matching (SPM) 
is usually utilized to incorporate the spatial informal 
BoF model has been applied to staining patterns classi 

Lazebni 

ion of a 
ication ( 

t et al., 
n image 

Wiliem 

2006) 

. The 

et al.. 


or more of the above four stages are tailored to obtain better cell image_repre¬ 


sentations for classification. Readers are referred to the review iFoggia et al. 
(2014) for more details. 

In the p ast several years, FV model has shown superior performance to the 


BoF model f Perronnin and Dance . 20071: Perronnin et al. . 2010 : Sanchez et al. 


20131) . Their main differences lie at dictionary learning and feature encoding. 
The dictionary in FV is generated by a probabilistic model, e.g., the Gaussian 
mixture model (GMM), that characterizes the distribution of local descrip¬ 
tors. Each local descriptor is then encoded by the first- and second-order 
gradients with respect to the model para m eters . FV model has also been 


applied to cell image classification flFaraki et al.l. I2014J: lHan et all 120141) . 


2.2. Deep Convolutional Neural Networks 

CNNs belong to a class of learning models inspired by the multi-stage 
processes of visual cortex dHubel and Wiesell. I1962IL A pi oneering work of 
CNNs was Fukushima’s “neocognitron” ( Fukushima . 1980h . It has a struc¬ 
ture similar to the hier archical model of the visual nervous system discovered 
by Hubei and Wiesel dHubel and Wiese! . 1959h . Each stage of the network 
imitates the functio ns o f simpl e and complex cells in the primary visual 
cortex. Later on, LeCun et al.l f 1998f) extended the neocognitron by utiliz¬ 
ing backpropagation algorithm to train the model parameters of CNNs and 
achieved excellent performance in hand-written digit recognition. 
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With the advent of fast parallel computing, better regularization strate¬ 
gies, and large-scale datasets, deep CNNs models have recently significantly 


fication, detection and retrieval (Razavian et ah, 2014 

, as well as other visual 

recognition tasks, such as face verification (Taigman et al. 

2014) 

and rnito- 

sis detection in breast cancer histopathology images 

(Veta et ah. 

2015). As 

for cell images classification. Malon et al. (Foggia and VentO, 201 

3) adopted 

a CNN to classify HEp-2 cell images. Buyssens et al. ( 

2013j) designed a 


multiscale CNN for cytological pleural cancer cells classification. Our CNN 
framework presented in this paper is different from their works in terms of 
both image preprocessing method and network architecture. More over, our 
CNN performs better than the CNN reported in Foggia and Ventol (120131) on 
ICPR 2012 HEp-2 cell classification. 

Although CNNs have been initially applied to cell image classification, 
the following issues have not been systematically investigated and thus re¬ 
main unclear: i) what are the key issues when adopting deep CNNs for cells 
classification? ii) how is the performance of the CNN-based classification 
model when compared with the well-established classification models in the 
literature, especially the BoF and FV models? These issues will be carefully 
investigated and addressed in this work. 


3. Proposed Framework 

The proposed deep CNN-based HEp-2 cell image classification framework 
consists of three components: image preprocessing, network training, and 
feature extraction and classification, which are elaborated in this section. 
Also, data augmentation which plays an important role in this classification 
framework will be described and analyzed. 

3.1. Network Architecture 

A proper selection of network architecture is crucial to CNNs. Usually, 
deep CNNs are composed of multiple convolutional layers interlaced with 
subsampling (pooling) layers, as shown in Fig. [0 Each layer outputs a set 
of two-dimensional feature maps, each of which represents a specific feature 
detected from all positions of the input. These feature maps are in turn used 
as the input of the next layer. Fully-connected layers are usually stacked on 
the top of the network to conduct classification. 
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Figure 1: The architecture of our deep convolutional neural network classification system 
for HEp-2 cell images. Each plane within the feature extraction stage denotes a feature 
map. The convolutional layer and max-pooling layer is abbreviated as C and P respec¬ 
tively. Cl:6@72 x 72 means that this is a convolutional layer, and is the first layer of the 
network. This layer is comprised of six feature maps, each of which has size of 72 x 72. 
The symbols and number above the feature maps of other layers have the similar meaning, 
whereas F7:150 means that this is a fully-connected layer. It is the seventh layer of the 
network and has 150 neurons. The words and number between two layers stand for: the 
operation, i.e., convolution or max-pooling, applied to the feature maps of the previous 
layer in order to obtain the feature maps of this layer; and the size of each filter or the 
size of pooling region. 


Our deep CNN shares the basic architecture as the classical LeNet-5 


I LeCun et al. . 1998). Specifically, it contains eight layers. Among them, 


the first six layers are convolutional layers alternated with pooling layers, 
and the remaining two are fully-connected layers for classification. 

3.1.1. Convolutional Layer 

Let’s assume that it is the Ith layer. Let N l denote the number of fea¬ 
ture maps at this layer, where l is used as a superscript. Accordingly, each 
feature map is denoted as h^- (j = 1 , 2 ,..., Ad). This convolutional layer is 
parametrized by an array of two-dimensional Liters associating the ith 
feature map h| _1 in the (l — l)th layer with the jth feature map in the Ith. 
layer and the bias bj. Each Liter acts as a feature detector to detect one par¬ 
ticular kind of feature by convolving with every location of the input feature 

[i = 1,2,..., Ad" 1 ) is Lrstly 


map. 


To obtain h), each input feature map h[ 1 


convolved with the corresponding Liter W| ; . The results are summed and 


appended with the bias b\. After that, a non-linear activation function </>(•) 


which can be sigmoid, tanh or rectihed linear function (IKrizhevskv et ah 
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2012), is applied in an element-wise manner. Mathematically, the feature 
maps of the Zth layer can be expressed as follows: 


N l 


h' = <P(Y, ht 1 * WC + 6'), j = 1,2, N‘. 


(1) 


i= 1 


where * denotes the convolution operation. 


3.1.2. Pooling Layer 

A pooling layer down-samples a feature map. This will greatly reduce the 
computation of training a CNN and also introduces invariance to small trans¬ 
lations of input images. Max-pooling or average-pooling is usually applied. 
The former selects the maximum activation over a small pooling region, while 
the latter uses the average activation over this region. Max-p ooling generally 


performs better than average-pooling (Boureau et ah], 2010). 


3.1.3. Classification Layer 

Classification layers usually involve one or more fully-connected layers at 
the top of a CNN. Our network contains two fully-connected layers. The 
first fully-connected layer (F7 in Fig. [T]) takes the cascade of all the feature 
maps of the sixth layer (denoted as h b ) as input. This layer is parametrized 
by weights W' and biases b'. The output of this layer h' is obtained as 
h' = 0(W'h 6 + b 1 ). The last fully-connected layer is the output layer 
and parametrized by weights W* and biases b 8 . It contains n neurons cor¬ 
responding to n classes of staining patterns, and outputs the probabilities 
y — [yi, y 2 , ..., y n \ T € via soft max regression as follows: 


h 8 

Vo 


W 8 h 7 + b 8 
exp(/r 8 ) 
£ILiexp(h 8 ) 


h 8 G 


, j = 1,2,..., n. 


( 2 ) 

(3) 


where y 3 is the output probability of the jth neuron. 

The network architecture of our deep CNN is illustrated in Fig. [0 Specifi¬ 
cally, the first layer convolves an input image with each of the six filters of size 
7x7 with a stride of one pixel, and then adds a bias to each of them after con- 
volution. We a dopt the hyperbolic tangent function fi(x) = 1.7159 tanh(|a;) 


fjLeCun et ah . 19981) as the activation function. The second layer takes 


the output of the first layer as input, and applies max-pooling over non- 
overlapping regions of size 2 x 2 for each feature map. The third layer adopts 
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filters of size 4x4, and has 16 feature maps. The fourth layer then applies 
max-pooling over non-overlapping pooling regions of size 3x3. The fifth 
layer employs filters of size 3x3 and includes 32 feature maps. The sixth 
layer employs 3x3 non-overlapping max-pooling to the output maps of the 
fifth layer. After that, the resulting 32 feature maps of size 3x3 are cascaded 
and passed to the first fully-connected layer containing 150 neurons. 

When a cell image is fed into the network, the spatial resolution of each 
feature map decreases as the features are extracted hierarchically from one 
layer to next. The spatial information of each cell is extracted by the feature 
maps because of the spatial convolution and pooling operations, which are 
important to distinct different staining pattern types. The features obtained 
are invariant to small translation or shift of cell images, because the filter 
weights of the convolutional layers are uniform for different regions of the 
input maps and max-pooling is robust to small variations. 


3.2. Image Preprocessing 

An appropriate image preprocessing method that takes the characteris¬ 
tic of images into consideration is necessary for deep CNNs to obtain good 
internal feature representation and classification performance. 

The brightness and contrast of the HEp-2 cell images provided by the 
1CPR 2014 contest (ICPR2014 dataset in short) vary greatly. To reduce this 
variance and enhance the contrast, we normalize each image by first sub¬ 
tracting the minimum intensity value of the image. The resulting intensity 
is then divided by the difference between the maximum and minimum in¬ 
tensity values. Furthermore, each image is resized to 78 x 78 to guarantee a 
uniform scale of all the images used for training. This size is approximately 
the average size of all the cell images. Examples of six staining patterns 
in ICPR2014 dataset and the preprocessed images are shown in Fig. [21 In 
addition, we just use the preprocessed whole cell images to train our network 
instead of adoptin g a mask to o nly keep the foreground within each cell as 


tiiy 

Malon et al. in flFoggia and Ventol. 2013D, because the mask information of 


each cell is usually unavailable in practice, and we find that the classification 
performance of our system is adversely affected by using cell masks. 


3.3. Data Augmentation 

Deep CNNs are high-capacity architecture having a large number of pa¬ 
rameters to be learned. It will be difficult to effectively_train a CNN when 


training images are insufficient. Data augmentation (Krizhevsky et al., 2012) 
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Figure 2: Example cells of six classes in ICPR2014 dataset and their corresponding pre- 
processed and aligned images. There are four images for each cell: (a) the original image; 

(b) the mask of this cell image (we do not take advantage of it for training the CNN); 

(c) the preprocessed image when the original image is applied contrast normalization and 
resized; (d) the aligned image when the contrast normalized image is aligned by PCA and 
then resized. 


has been regarded as a simple and effective way to generate more samples to 
train a CNN and gain robustness against a variety of variances. 

For data augmentation in the cell image classification, we identify the 
following two points: i) generating new training images by rotating existing 
ones can effectively boost the classification performance of the CNNs; ii) 
instead of merely increasing the robustness of the CNNs against the global 
orientation of a cell, the extra samples generated via such rotation-based 
augmentation help to show the intrinsic distribution of the staining patterns 
belonging to each cell category, which is a more important factor contributing 
to the improvement of the classification performance. 

To demonstrate the first point, we keep rotating each training image 
with respect to its center by a step of 9 degree. The newly generated images 
inherit the class label from the original training image, because rotating a 
cell image does not change its class label. By doing so, the original training 
set is enlarged by a factor of m — ^p, and this augmented training set is 
used to train the CNN. 

To demonstrate the second point, we pre-align each cell image to approxi¬ 
mately have the same global orientation. In this way, if the global orientation 
variance is really the main factor affecting the training performance of the 


9 







CNN, we shall observe some improvement by using the pre-aligned training 
set. Also, augmenting this pre-aligned training set with rotated images shall 
not lead to significantly better classification performance. 

To investigate our hypothesis, we apply principal component analysis 
(PCA) to each cell’s mask to obtain the principal direction of its shape. 
Each contrast normalized cell is rotated to make this principal direction 
to be vertical and then is resized. Applying this process to all training cell 
images makes them pre-aligned. These operations are illustrated in the upper 
left portion (as indicated) in FigJ21 followed by more examples of cell images 
before and after alignment. After that, we use the pre-aligned training images 
to train the CNN and then classify test images which are also pre-aligned. 

We find that the CNN trained in this manner does not show better perfor¬ 
mance than the CNN trained with the preprocessed training images without 
alignment. However, when data augmentation is applied to the pre-aligned 
training set images, the performance of the trained CNN increases greatly. 
This indicates that, in terms of cell classification, adequately demonstrating 
the staining patterns within a cell image is more important than removing the 
global orientation variancej§. Detailed experimental results will be presented 
in Section [H 


3.4- Network Training 

Due to the non-convex property of the cost surface of CNNs, it is essential 
to select appropriate network training parameters, e.g. ; learning rate, and 
regularization methods, e.g., weight decay and dropout flHinton et all 2012) 
to make the network converge to good solutions fast. 

Our deep CNN is parameterized by the weights and biases of different con¬ 
volutional layers and fully-connected layers {W / ,b / }, where l = 1,3, 5, 7, 8. 
The total number of parameters is over 50, 000. The network is trained 
by minimizing the cross-entropy between the output probability vector y = 
[f/i, 2/2 , ■■■,y n ] T and the binary class label vector y = [yi,y 2 , •••, 2 /n] T with one 
non-zero entry “1” corresponding to the true class, which is expressed as 


2 A good example in contrast is human facial image, for which pre-alignment is generally 
helpful for recognition. This is because the patterns within a facial image, e.g., eyes, nose 
and mouth, have a rigid geometric association with the global orientation of the face. Pre¬ 
aligning the faces with respect to their global orientations effectively makes the patterns 
inside align with each other. Nevertheless, it is not such a case for cell images. 
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follows. 


E(y,y) = 


n 

£ 

3 = 1 


Vj lo s(%- 


( 4 ) 


The weights are initialized from a uniform distribution and the biases are 
initialized to zero. All these trainable param eters ar e updated periodically 
via stochastic gradient descent (SGD) flLeCun et all 1998b after evaluating 
the cost function. Let w l denote a weight of the Zth layer, i.e., an element of 
W ( . Let b l be a bias of the Ith layer (an element of t/). Each weight w l and 
bias b l are updated by the following rules: 


w l ■= w l — Tj ■ 


DE 
dw l ’ 


b l :=b l -ri- 


dE 


( 5 ) 


dE and are the partial derivatives of the 


where rj is the learning rate, and and 
cost function with respect to w l and b l respectively. They are calculated and 


u pda ted via back-propagating the output error to the Ith layer (L eCun et aL 


19891 1 after a number of training images (a mini-batch (IBengiol . 2012)) feed 
into the network. 

To smooth the directions of gradient descent and make the network con¬ 
verge fast, we employ momentum ( Bengiol . 2012 1 to speed up the learning 
by guiding the descent direction with past gradients. The update rules of w l 
and b l become as the follows: 


= a ■ v l w - (3 ■ rj -w—r)- 


DE 
dw l 1 


w := 


w l + v l w 


v h := a ■ vi 


dE ,1 rl 1 

V 'dv ; b:=b+v » 


( 6 ) 


where v l w and v l b are the momentum variables for w l and b l respectively; a 
and /3 are the coefficients of momentum term and weight decay term, and 
their optimal values are experimentally tuned, as shown in Section [4j When 
training error rate becomes stabilized, the learning rate i] will be reduced 
to achieve finer learning. The whole training process terminates after the 
classification error rates of both training set and validation set (which is held 
out from the given training images) plateau at some epochs. 

In addition , another newly developed regularization strategy, dropout 
fjHinton et all 2012), is also investigated in the network training. It ran¬ 
domly sets a fraction of the activations in the hidden layers to zero to force 
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the hidden units to learn more independent and robust features that could 
generalize well and to prevent overfitting. 


3.5. Feature Extraction and Classification 

When classifying a test image, the same preprocessing and rotation in 
Section 1X21 and 1X51 are applied. This results in m rotated variants in total. 
Each of them is forward-propagated through the network, and the probability 
of this image for each of the n classes is obtained. To further improve the 
robustness of classification, we select four similar CNNs after the training 
pr ocess becomes stable and use them collectively for classification following 
Krizhevskv et al. ( 2012th The predicted class is the one having the maximum 
output probability averaged over the 4m probabilities, that is, 


m 4 


l = arg max y.j = arg max — V V Vik, j = 1,2, 
3 3 4m ^ 

k =1 J=1 


, n. 


(7) 


4. Experimental Results 


We evaluate our CNN classification system on two datasets of HEp-2 
cell classification competition held by ICPR 2014 and 2012. The evaluation 
criterion is the mean class accuracy (MCA) newly adop ted by ICPR 2014 
competition. It is the average of the per-class accuracies (ILovell et all 2014) 
defined as follows: 


1 n 

MCA = - V COR* 

n ^ 

k=1 


( 8 ) 


where CCR& is the classification accuracy of class k and n is the number of 
cell classes. 

The average classification accuracy (ACA), which is the overall correct 
classification rate of all the cell images, used by the previous competition is 
also calculated for the ease of comparison. 


4-1- Introduction of the HEp-2 Cell Datasets 

ICPR2014 cell dataset. This dataset contains 13, 596 training cell im¬ 
ages, and the test set is reserved by the competition organizers and not pub¬ 
lished yet. The cell images are extracted from 83 specimen images captured 
by monochrome high dynamic range cooled microscopy camera fitted on a 
microscope with a plane-Apochromat 20x/0.8 objective lens and an LED 
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illumination source flLovell et al.l . 2014). These specimen images have been 
automatically segmented by using the DAPI channel and manually annotated 
by specialists. Each image belongs to one of the six staining patterns: Ho¬ 
mogeneous, Speckled, Nucleolar, Centromere, Nuclear Membrane and Golgi , 
as shown in the top row of Fig. [31 

ICPR2012 cell dataset. It consists of 1,455 cell images extracted from 
28 specimens, which are acquired with a fluorescence microscope (40-fold 
magnification) coupled with 50W mercury vapor lamp and with a digital 
camera ( Foggia and Vento . 2013f ). The dataset is pre-partitioned into train¬ 
ing set (721 images) and test set (734 images). Each image belongs to one of 
the six classes: Homogeneous, Coarse Speckled, Nucleolar, Centromere, Fine 
Speckled and Cytoplasmic , as shown in the bottom row of Fig. [31 

Comparing the two datasets shows that two of the six classes are dif¬ 
ferent. Specifically, two sub-categories of ICPR2012 dataset ( Fine Speckled 
and Coarse Speckled ) are merged into one category ( Speckled ) in ICPR2014 
dataset, and two less frequent staining patterns appearing in daily clinical 
cases, Golgi and Nuclear Membrane are introduced in ICPR2014 dataset for 
developing more realistic HEp-2 cell classification systems. Moreover, be¬ 
cause the images in the two datasets are captured with different 
laboratory settings, a classification system that can be easily trans¬ 
ferred from one dataset to the other one will be highly desired. 



Nuclear 

Homogeneous Speckled Nucleolar Centromere Membrane Golgi 

2494 2831 2598 2741 2208 724 



Coarse Fine 

Homogeneous Speckled Nucleolar Centromere Speckled Cytoplasmic 

150 109 102 208 94 58 


Figure 3: Comparison of HEp-2 cell images of ICPR2014 dataset (top row) and ICPR2012 
dataset (bottom row). The number below the name of each cell is the total number of this 
kind of cells in the training set of each dataset. 
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4-2. Experiments of Hyper-parameters Optimization 

This experiment demonstrates the importance of properly tuning the 
hyper-parameters in the CNN-based system. We categorize the hyper-parameters 
into two groups: mo del-relevant and training-relevant, as listed in Tables [I] 
and [21 


Layer Number 

Layer Type 

Hyper-parameter 

Layer 1 

Convolution 

Filter size: 7x7 

Feature map number: 6 

Activation function: 

hyperbolic tangent <f(x) = 1.7159tanh(|x) 

Layer 2 

Pooling 

Pooling region size: 2x2 

Pooling method: max-pooling 

Layer 3 

Convolution 

Filter size: 4x4 

Feature map number: 16 

Activation function: 

hyperbolic tangent <f(x) = 1.7159 tanh(|x) 

Layer 4 

Pooling 

Pooling region size: 3x3 

Pooling method: max-pooling 

Layer 5 

Convolution 

Filter size: 3x3 

Feature map number: 32 

Activation function: 

hyperbolic tangent f(x) = 1.7159 tanh(|x) 

Layer 6 

Pooling 

Pooling region size: 3x3 

Pooling method: max-pooling 

Layer 7 

Full connection 

Neurons number: 150 

Activation function: 

hyperbolic tangent <f(x) = 1.7159 tanh(|x) 

Table 1: Model-relevant hyper-parameters obtained 

Hyper-parameter 

Initial Mim-batch Momentum Weight decay _ 

... . ,,, . . Dropout ratio 

learning rate size coemcient coethcient 

Value 

0.01 

113 0.9 0.0005 0 


Table 2: Training-relevant hyper-parameters obtained 


To tune these hyper-parameters, we randomly partition the 13, 596 cell 
images of ICPR2014 dataset into three subsets, that is, 64% for training (8701 
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images), 16% for validation (2175 images), and 20% for test (2720 images). 
This partition is utilized by all experiments on ICPR2014 dataset (multiple 
partitions could be certainly implemented when the computational resource is 
not an issuej. Dat a au gmentation is not used when tuning hyper-parameters. 
Following Bengiol ( 2012 1. the parameters are tuned until the error rate of not 
only the training set but also the validation set become sufficiently small 
and stabilized. The hyper-parameters obtained by this tuning process are 
summarized in Tables [j] and 0 

We highlight that training-relevant hyper-parameters can significantly 
affect the convergence of cost function, the learning speed and the gener¬ 
alization capability of the network. Their impacts are demonstrated via the 
learning curves of MCA on training, validation and test sets shown from Fig. 
[4] to Fig. 0 In each figure, we focus on one hyper-parameter while the others 
are set to their optimal values in Table [2] 

Fig. 0 |(a)| indicates that when learning rate is small, e.g., 0.001, the learn¬ 
ing process is so slow that the MCA of the three sets have not become stable 
in 100 epochs. Properly increasing the learning rate effectively improves 
learning efficiency and the MCA becomes stable in 35 epochs, as shown in 
Fig. H ftbJ At the same time, an over-large learning rate, e.g., 0.1, will desta¬ 


bilize the learning process and degrade the classification performance. Also, 
Fig. 0 0 and [7] demonstrate the impacts of mini-batch size, momentum and 
weight decay, respectively. 

The comparison in Fig. 0 shows that the dropout strategy ( Hinton et all 


2012) shall be used cautiously. When dropout with ratio of 0.5 (randomly 


setting the activations to zero with probability of 0.5) is applied to the first 
fully-connected layer of our CNN system, the learning process becomes slow 
and fluctuated on ICPR2014 cell dataset. A stabler and faster learning pro¬ 
cess without overfitting on the test set is gained when removing dropout, 
as well as better classification performance. This indicates that the neurons 
at the first fully-connected layer may have to work together to distinguish 
different staining patterns. In light of this, we decide not to employ dropout 
when training our network on ICPR2014 dataset. 
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(a) Learning rate = 0.001 (b) Learning rate = 0.01 (c) Learning rate = 0.1 


Figure 4: Demonstration of the impact of learning rate. It shows that an over-small 
learning rate, e.g., 0.001, slows down the learning process, whereas an over-large learning 
rate, e.g., 0.1, destabilizes the learning process and degrades the classification performance. 
A better classification result can be obtained by properly tuning the learning rate, as shown 
in (b). 




(a) Mini-batch size =11 (b) Mini-batch size = 77 




(c) Mini-batch size = 113 (d) Mini-batch size = 791 


Figure 5: Demonstration of the impact of mini-batch size. It shows that when mini-batch 
size is unnecessarily small, the learning process becomes bumpy and does not lead to the 
best result. On the other hand, when the mini-batch size is too large, the learning process 
becomes less responsive and the learning efficiency is decreased. 
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(a) Momentum coefficient = 0 (b) Momentum coefficient = 0.8 



(c) Momentum coefficient = 0.9 



(d) Momentum coefficient = 0.97 


Figure 6: Demonstration of the impact of momentum. It shows that using momentum 
can well accelerate the learning process. Meanwhile, a large momentum coefficient, e.g., 
0.97, makes the descent direction dominated by the previous ones and causes oscillation 
at the initial stage. Also, it decreases the classification performance at the later stage. 





(a) Weight decay coefficient (b) Weight decay coefficient (c) Weight decay coefficient = 
= 0.00005 = 0.0005 0.005 


Figure 7: Demonstration of the impact of weight decay. It shows that a smaller weight 
decay coefficient seems to be a safer choice, while a larger coefficient, e.g., 0.005, could 
destabilize the learning process. 
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(a) Dropout ratio = 0.5 (b) Dropout ratio = 0 

Figure 8: Demonstration of the impact of dropout. It shows that the dropout strategy 
shall be used cautiously. As seen in (a), the learning process becomes slow and fluctuated 
on ICPR2014 cell dataset, when dropout is applied. A better learning process is obtained 
in (b) after removing dropout. 

In sum, among the hyper-parameters of a CNN, the learning rate, mini¬ 
batch size, momentum coefficient, and weight decay coefficient can signifi¬ 
cantly impact the network training process. They have to be carefully tuned 
before satisfactory classification performance is obtained. For our deep CNN 
system, with the hyper-parameters set in Table [21 we can achieve the MCA 
of 89.17% on the test set of ICPR2014 dataset without using data augmen¬ 
tation. 

4-3. Experiments on Data Augmentation 

This experiment demonstrates the two points presented in Section 13.31 
which are recapped as follows: i) the performance of the CNN can be greatly 
boosted by generating new training images via rotation; ii) the extra samples 
generated via such rotation-based augmentation help to enrich our observa¬ 
tions of the staining patterns of each cell category for training the CNN, 
which is a more important factor contributing to the improvement of the 
classification performance than increasing robustness of the CNN against 
the global orientation of cells. 

Effectiveness of data augmentation. We augment the training set 

by rotating each cell image for 360°, with the step of 36°, 18° and 9°, re¬ 
spectively. In this way, the training set is expanded by 10, 20 and 40 times, 
and they are used to train the CNNs, respectively. To improve the robust¬ 
ness of our system, we select four CNNs corresponding to the 75th, 85th, 
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95t h and 100th epochs after the network learning becomes stablcjH as 


Kri zhevsky et ah (2012). A test image will go through the same rotation 


m 


process as the training images and be jointly classified by the four CNNs as 
in Eq. (J7J) . This system is named as “CNN”. As shown in the first row of Ta¬ 
ble [31 the MCA is significantly improved (by more than 7 percentage points) 
from “No data augmentation” to “Augmentation by a rotation angle step of 
36°”. Furthermore, applying a smaller angle step to generate more training 
data pushes the MCA even higher, reaching 96.76%. Similar results can be 
observed on the ACA values. These consistent and continuous improvements 
well demonstrate the effectiveness and efficiency of data augmentation on cell 
image classification. 


Method 

Accuracy 

(on test set) 

No 

data augmentation 

Augmentation by 
a rotation angle step of 36° 

Augmentation by 
a rotation angle step of 18° 

Augmentation by 
a rotation angle step of 9° 

CNN 

MCA(%) 

ACA(%) 

88.58 

89.04 

95.99 

96.51 

96.71 

97.10 

96.76 

97.24 

CNN-Align 

MCA(%) 

ACA(%) 

88.86 

88.71 

95.13 

95.33 

96.50 

96.84 

96.52 

96.84 


Table 3: Classification accuracy of our deep CNN on ICPR2014 dataset 


Data augmentation vs pre-alignment. To gain more insight on the 

rotation-based data augmentation, we pre-align all the cell images with PCA 
as described in Section [3731 to train the CNNs. We call this method “CNN- 
Align”. Two experiments are conducted: i) only using these aligned images 
to train the CNNs without performing data augmentation; and ii) as a com¬ 
parison, we further rotate each aligned training image by 360°, also with an 
angle step of 36°, 18° and 9°, respectively. The augmented training set is 
used for training. As previous, augmentation (or no augmentation) is equally 
applied to test images. 

As shown in Table [31 when no augmentation is performed, CNN-Align 
does not achieve any improvement over CNN. This indicates that pre-alignment 
does not help here. In contrast, when training data are augmented by ro¬ 
tation (even with the largest angle step of 36°), CNN-Align improves signif¬ 
icantly. This sharp change clearly demonstrates that through the rotation- 
based augmentation, the network can access more examples showing the 
diverse staining patterns within cell images. This is a more important factor 


3 This strategy is adopted as a model average. Different number of CNNs may be 
chosen, e.g. 3 or 5, to compromise between the computational expense and performance, 
which leads to similar classification accuracy in our experiments. 
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contributing to the performance improvement compared with pre-alignment 
that only tackles the global orientation variance of cells. 

The features (filters) learned by the first and second convolutional layers 
of CNN corresponding to the 100th epoch trained with 9° rotated cell images 
are depicted as Fig. [9j It can be seen that the filters of the first convolutional 
layer are stain-like texture detectors. Some of the second convolutional layer 
filters are edge-like detectors, and most of them are also stain-like texture 
extractors. 


talari t 

I 'm JLin r?rv„ 

J- T JHHHr 

rnrP'ri-TfeBiHvt 
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rj-NJF^ ^. p i 

Jir in-j 

(a) 1st convolutional layer features (b) 2nd convolutional layer features 


Figure 9: The features learned by the first and second convolutional layers. In general, 
most of the filters are stain-like texture detectors, and some are edge-like extractors. 
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Figure 10: Confusion matrix of our best CNN (9° rotation) (%). 


In addition, the confusion matrix of the best CNN (trained with the 
rotation angle step of 9°) is shown in Fig. (TUJ The overall classification 
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performance is very promising. The staining patterns Nucleolar and Nu¬ 
clear Membrane obtain the highest classification accuracy (both 98.87%), 
which means that they are well separated from the others. The maximum 
misclassification rate (4.85%) happens to Golgi cells. They are easy to be 
misclassified as Nucleolar cells, because both patterns consist of a few large 
dots within the cells (see misclassification examples in Fig. fill) . Also, Golgi 
can be confused with Nuclear Membrane. This may be because when the 
large dots within Golgi cells are at the edge, they will look like the Nuclear 
Membrane cells having ring-like edges. In addition, the Speckled cells are easy 
to be misclassified as Homogeneous cells, probably because the densely dis¬ 
tributed speckles are the main signatures for both patterns. Misclassification 
examples of these staining patterns are shown in Fig. fill 
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Figure 11: Misclassification examples of the three highest misclassification rates in the 
confusion matrix of Fig. 1101 Every two rows form a group, and the first row shows cells 
that are misclassified to the cell type of the second row. 


4-4- Comparison with the BoF and Fisher Vector Models 

Experimental setting. To ensure a fair comparison, the same image 
preprocessing in our CNN model is equally used in both models. For each cell 
image, SIFT descriptors are extracted from densely sampled patches with a 
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stride of two pixels. The visual dictionary is generated by applying the k- 
means clustering to the descri ptors extracted from training images. Lo cal 
soft-assignment coding (LSC) ( Van Gemert et al. . 20081: Liu et al. . 2011 1 is 
employed to encode the SIFT descriptors. SPM is used to partition each 
image into lxl, 2x2 and 1x3 regions, and max-pooling is applied to 
extract representations from each region. 

A similar setting is applied to the FV model. In addition, the 128- 
dimensional SIFT descrip tors are decorrelated and reduced to dimensions 
of 64 by PCA as in Sanchez et ah ( 2013 1. A GMM is then estimated to rep¬ 
resent the visual dictionary. Afterwards, each PCA -redu ced S IFT descriptor 
is encoded with the improved Fisher encoding (Perronnin ct ah. 12010) . where 
the signed square-root and ^-normalization are applied to the coding vecto r. 


SPM with four regions (lxl and 1x3) are adopted (I Sanchez et all 120131 ) 


Following the literature, a multi-class linear SVM classifier is used in the 
BoF and FV models. In our implementation of BoF a nd FV, the publicly 
available VLFeat toolbox (Vedaldi and Fulkerson, 20101 1 is used. 

Parameter setting. There are two primary parameters in the BoF 
and FV models: patch size and dictionary size (or equally, the number of 
components of the GMM in the FV model). We tune these parameters by 
five-fold cross-validation on the union of training and validation sets, with the 
criterion of MCA. The candidate patch sizes are 9x9, 11 x 11, 13 x 13, 15 x 15 
and 20 x 20, while the candidate dictionary sizes are 1,000, 2,000, 3,000, 
4,000, 5,000 and 10,000. Also, the number of Gaussian components will be 
chosen from 64, 128, 256, 512 and 1024 for FV. Through the cross-validation, 
the patch size and the dictionary size in the BoF model are selected as 15 x 15 
and 10,000. With the use of SPM, this results in a 80, 000-dimensional 
representation for each cell image. For the FV model, the patch size is 
chosen as 20 x 20 and the number of GMM components is 512. With the use 
of SPM, this leads to a 262,144-dimensional representation for each image. 

Comparison results. The BoF, FV and CNN models are compared on 
the same training and test sets. Also, both of the cases, i.e., with and without 
data augmentation, are investigated. To be fair, when data augmentation is 
used, the visual dictionary in the BoF and FV models will be built with the 
augmented training set. Also, to keep consistent with the setting of our deep 
CNN system, each test image in this case will be equally augmented and its 
label is predicted in the way similar to Eq.((7]), except that the probabilities 
are replaced by the decision values of the linear SVM classifier. 

As shown in Table U FV is consistently better than BoF, regardless of 
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whether data augmentation is applied or not. This agrees well with the lit¬ 
erature. Furthermore, both BoF and FV can well benefit from data augmen¬ 
tation, with an average performance increase of about 4 percentage points. 
Compared with BoF and FV, CNN system shows slightly lower performance 
(88.85% vs 89.83% for BoF and 91.60% for FV), when there is no augmenta¬ 
tion. However, CNN outperforms both BoF and FV once data augmentation 
is applied. In specific, the highest MCA, 96.76%, is obtained by our CNN, 
while BoF and FV achieve only 94.23% and 95.73% respectively. Similar sit¬ 
uation can be observed from the ACA values. These results suggest that i) 
when training samples are not sufficient, the high-capacity CNN is more diffi¬ 
cult to train than the shallower, hand-designed models such as BoF and FV; 
and ii) by properly using data augmentation to generate more training data, 
the CNN can be better trained and are able to achieve better performance 
than the BoF and FV models. 


Accuracy 

(on test set) 

Methods 

No 

data augmentation 

Augmentation by 
a rotation angle step of 36° 

Augmentation by 
a rotation angle step of 18° 

Augmentation by 
a rotation angle step of 9° 


BoF 

89.83 

94.23 

93.98 

94.14 

MCA (%) 

FV 

91.60 

95.41 

95.73 

95.53 


CNN 

88.58 

95.99 

96.71 

96.76 


BoF 

90.70 

94.30 

94.19 

94.38 

ACA (%) 

FV 

92.65 

95.78 

96.07 

95.81 


CNN 

89.04 

96.51 

97.10 

97.24 


Table 4: Comparison of classification accuracy among the methods of BoF, FV and our 
deep CNN on ICPR2014 datatset 


4-5. Experiments on the Adaptability across Datasets 

As previously mentioned, HEp-2 cell image classification varies with lab¬ 
oratory settings, the types of staining patterns involved, and the size of 
dataset. Such differences can be well seen from the ICPR2014 and ICPR2012 
datasets. As a result, it is highly desired that a cell classification system 
trained with one dataset can be conveniently adapted to another one. Own¬ 
ing this feature not only improves the efficiency of system building, but also 
can take full advantages of the image data in different datasets. To demon¬ 
strate this feature for our CNN-based system, we compare the CNN purely 
trained on ICPR2012 dataset (called CNN-Standard in short) with the other 
CNN which is an adapted version of the CNN pre-trained on ICPR2014 
dataset to ICPR2012 dataset (called CNN-Finetuning). 

Following previous experimental settings, CNN-Standard is trained with 
the 721 training images predefined in ICPR2012 dataset. Only the green 
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channel of each image is kept and the same preprocessing in Section 13.21 
is performed. The dropout strategy (with ratio of 0.5) is used, because it 
can benefit network training and classification performance on this small 
dataset. CNN-Standard is trained by 100 epochs and then used to classify 
the predefined test images by following Eq.(J7]). 

To train CNN-Finetuning, we first select a basic CNN system learned 
with the ICPR2014 dataset. It is the one obtained at the 100th epoch when 
the system is trained with an augmented (rotation with an angle step of 9°) 
training set of ICPR2014. Afterwards, this basic system is fine-tuned with 
the training set of ICPR2012 dataset, with or without data augmentation. 
All the trainable network parameters of different layers are updated during 
this fine-tuning process. To demonstrate the efficiency, we only fine-tune this 
basic system by 10 epochs, which takes significantly less time than the 100 
epochs spent in training CNN-Standard. 
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-Rotation angle step of 18 degree 
Rotation angle step of 9 degree 


'1 23456789 
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Figure 12: The MCA of test set obtained by CNN-Finetuning at each of the 10 epochs. 
Data augmentation with various angle steps is investigated. 


The evolution of the MCA on test set with the 10 epochs is plotted in 
Fig. [12] As shown by the line of “No rotation”, CNN-Finetuning does not 
work well at the beginning. Nevertheless, it catches up quickly in a couple 
of epochs and reaches a satisfying performance in 10 epochs. Furthermore, 
the adaption stage is significantly shortened, by applying data augmentation 
to the small training set of ICPR2012 to increase training samples. These 
results demonstrate the high efficiency of the adaptability of our CNN-based 
system, especially considering that there are two different classes of staining 
patterns across these datasets. Comparison of CNN-Standard and CNN- 


24 












Finetuning is shown in Tabled It is interesting to note that CNN-Finetuning 
consistently outperforms CNN-Standard, even though it is only fine-tuned for 
a few epochs. We attribute its superiority to the good initialization of the 
network obtained from the training process on ICPR2014 dataset. Based on 
the above results, we believe that our CNN-based system will be a better 
option for practical applications. 


Accuracy 

Methods 

No 

Augmentation by 

Augmentation by 

Augmentation by 

(on test set) 

data augmentation 

a rotation angle step of 36° 

a rotation angle step of 18° 

a rotation angle step of 9° 

MCA (%) 

CNN-Standaxd 

63.1 

72.4 

72.4 

73.2 

CNN-Finetuning 

74.5 

76.3 

76.2 

74.9 

ACA (%) 

CNN-Standard 

64.3 

70.2 

70.0 

70.1 

CNN-Finetuning 

72.9 

74.8 

74.7 

73.3 


Table 5: Classification accuracy of our CNN-based system on ICPR2012 dataset 


At last, we compare our CNN-Finetuning (rotation with an angle step 
of 36°) with other methods reported in the literature in Table [6] As seen, 
it outperforms the best-performing method of that contest and the CNN 
at the ICPR2012 contest. For that CNN, a 100 x 100 pixels area of the 
green channel centered at the largest connected component of each cell is 
taken via the mask and then is normalized by mapping the first and 99th 
percentile values to 0 and 1. The architecture of that CNN is composed 
of two sequences of convolution, absolute value rectification and subtractive 
normalization, one average pooling layer, one max pooling layer and one fully 
connected la.yeiB, which is also quite different from our architecture. The 
better performance of our CNN may benefit from these differences as well as 
our effective data augment ation. Also, our CNN-F i netunin g is just slightly 
inferior to the method in Theodorakopoulos et al. ( 2014bl ). That method 
combines two kinds of hand-crafted features: the distribution of SIFT and 
gradient-oriented co-occurrence LBP, and a dissimilarity representation of an 
image is created with them. 


4 Please refer to the contest report available at 

http://mivia.unisa.it/hep2contest/HEp2-Contest_Report.pdf for the detailed 
presentation of the contest CNN. 
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Method 

Average classification accuracy (ACA) 

2012 contest 

best-performing method IFqgpia, and Vento. 201,1') 

68.7% 

2012 contest CNN IFoppia, and Vento. 2013) 

59.8% 

Nosaka and Fukui (2014) 

68.5% 

Shen et al. (2014) 

74.4% 

Faraki et al. (2014) 

70.2% 

Larsen et al. 12014) 

71.5% 

Theodora,kononlos et al. 12014b) 

75.1% 

Our CNN-Finetuning 

74.8% 


Table 6: Comparison with other methods on the ICPR2012 dataset 


4-6. Discussion on Computational Issues 

For the CNN-based classification system, training the network is the most 
time-consuming step in the whole pipeline. However, this process can be well 
accelerated by utilizing GPU programming. Also, as previously shown, an 
existing CNN-based system can be efficiently transferred to a new but related 
task via a short training process. Once the networks are trained, a test cell 
image only needs to go through the four networks and then is classified 
within 1.2 seconds in total with Matlab implementation on a computer with 
3.30GHz Intel CPU and 16GB RAM. 

For the BoF and FV models, building visual dictionary or the GMM 
is computationally intensive, especially when there are a large number of 
training images, e.g., due to the use of data augmentation. For example, 
building a dictionary of 10,000 visual words and the GMM of 512 components 


In addition, it is worth mentioning that in the ICPR2014 contest (1 Lovell et a. 


2Q14]), the three methods that perform better than or comparable to our 


deep CNNs system (87.10%, 83.64% and 83.33% vs 83.23% with the MCA 
criterion) are all built on two-stage frameworks: hand-designed feature repre¬ 
sentation and classification. The top-ranked method utilizes multi-scale and 


multiple types of local descriptors ( Manivannan et ah . 2014 ): the second- 


ranked meth od adopts the hand-crafted rotation invariant dense scale local 
descriptor flGragnaniello et all 120141) ; and the third method combines mor¬ 


pholog ical features and different local texture features (ITheodorakopoulos et al. 


2014a). In contrast, our CNN system generates discriminative features from 


raw pixels directly by utilizing class label information and jointly learns the 
classifier in a single architecture without learning extra dictionaries as these 
methods. 
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takes more than 4 days and 2 days in our implementation, when the training 
set of ICPR2014 dataset is augmented by rotation with an angle step of 9°. 
Also, a large dictionary in the BoF model could slow down the encoding 
process, e.g., around 78 seconds per image in our experiment. Although the 
time for this process can be reduced in the FV model, it still takes about 
three seconds per image. In addition, SPM is usually needed to attain better 
classification performance. In this case, the dimensions of the resulting image 
representation are much higher than that in the CNN-based system (80, 000 
or 262,144 vs 150 only). 


5. Conclusion 


This paper proposes an automatic HEp-2 cell staining patterns classifica¬ 
tion framework with deep convolutional neural networks. We give a detailed 
description on various aspects of this framework and carefully discuss a num¬ 
ber of key issues that could affect its classification performance. Extensive 
experimental study on two benchmark datasets demonstrates i) the advan¬ 
tages of our framework over the well-established image classification models 
on cell image classification; ii) the importance and effectiveness of data aug¬ 
mentation, especially when training images are not sufficient; iii) the desir¬ 
able adaptability of our CNN-based system across different datasets, which 
makes our system attractive for practical tasks. Much future work can be 
done to further improve the performance of the proposed system. In par¬ 
ticular, a super-CNN trained with a large-scale generic image benchmark, 
ImageNet ( Deng et ah . 201Cll ). has recently prevailed on many generic visual 
recognition tasks. We would like to explore the effectiveness of the features 
generated by this CNN for HEp-2 cell image and the adaption of this CNN 
to cell image classification. These issues will be of significance considering 
the substantial differences between generic images and HEp-2 cell images. 
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